Fix performance regression for prefix caching #4270

lzhangzz · 2026-01-13T11:13:49Z

In #4223, the change of scheduling policy causes more cache miss when sending long and repeated requests in a burst. This PR fixes it by

Allowing to cache blocks from partially prefilled sequences.
Allowing partially prefilled sequences to match against the cache trie.

Copilot

Pull request overview

This PR fixes a performance regression in prefix caching introduced by #4223 that caused increased cache misses when sending long and repeated requests in bursts. The fix enables caching and matching of prefix blocks from partially prefilled sequences.

Changes:

Modified CachePrompt and PrefixMatch functions to allow caching and matching of blocks from partially prefilled sequences (where cache_len < prompt.size())
Changed Materialize return value attribute from [[nodiscard]] to [[maybe_unused]] since the return value is not always used
Replaced FT_CHECK macros with TM_CHECK variants for consistency
Refactored code formatting and variable initialization in several files
Added async_ parameter to benchmark script

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
src/turbomind/models/llama/SequenceManager.h	Changed Materialize attribute from [[nodiscard]] to [[maybe_unused]]
src/turbomind/models/llama/SequenceManager.cc	Enabled caching/matching for partially prefilled sequences; replaced FT_CHECK with TM_CHECK; cleaned up formatting
src/turbomind/models/llama/BlockTrie.cc	Refactored Match and Cache functions with clearer variable names and improved control flow
src/turbomind/models/llama/BlockManager.h	Removed unused function declaration
src/turbomind/models/llama/BlockManager.cc	Replaced FT_CHECK with TM_CHECK; removed unused function; added timestamp initialization
src/turbomind/engine/engine.cc	Removed unused variables; removed unnecessary lambda captures
benchmark/profile_pipeline_api.py	Added async_ parameter support

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-01-15T05:43:11Z

src/turbomind/models/llama/SequenceManager.cc

-            BlockIds  block_ids;
-            UniqueIds block_unique_ids;
-            std::tie(block_ids, block_unique_ids) = block_trie_->Cache(seq, seq.prompt);
+        if (auto& seq = *sequences[i]; !seq.prompt.empty()) {


The condition !seq.prompt.empty() should also check if there are blocks available to cache. When seq.cache_len < seq.prompt.size() but seq.blocks is empty or insufficient, attempting to cache could lead to unexpected behavior. Consider adding a check for seq.blocks.size() to ensure blocks exist before caching.

Suggested change

if (auto& seq = *sequences[i]; !seq.prompt.empty()) {

if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) {

Copilot · 2026-01-15T05:43:11Z

src/turbomind/models/llama/SequenceManager.cc

+        }
+
+        /// TODO: `Unlock` and `Lock` can't be batched because there may be repeated blocks between sequences
+        if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {


The offset calculation using integer division could lead to data loss if seq.cache_len is not perfectly aligned with block_seq_len_. Consider adding a check or comment to clarify the expected behavior when seq.cache_len % block_seq_len_ != 0.

Suggested change

if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {

// NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below

// intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero

// remainder) is ignored here and will be effectively truncated.

const int offset = seq.cache_len / block_seq_len_;

if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) {

TM_LOG_DEBUG(

"[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks",

seq.id, seq.cache_len, block_seq_len_, offset);

}

if (offset < static_cast<int>(block_ids.size())) {

Copilot · 2026-01-15T05:43:13Z

src/turbomind/models/llama/BlockManager.cc

        b.use_count = 1;
        b.unique_id = unique_id_++;
-        FT_CHECK(is_active(b));  // post-condition
+        b.timestamp = timestamp_++;


The timestamp is now being set during Allocate, but this appears to be a new addition. The timestamp should logically be set when blocks are allocated, but verify this doesn't cause issues with LRU eviction logic, especially if blocks are allocated but not immediately used.

lzhangzz added 3 commits January 13, 2026 10:46

fix performance regression for prefix caching

eccb112

fix lint

8831712

fix lint

ca68a74

lvhan028 self-requested a review January 14, 2026 02:58

lvhan028 added the Bug:P1 label Jan 14, 2026

lvhan028 requested a review from Copilot January 15, 2026 05:39

Copilot started reviewing on behalf of lvhan028 January 15, 2026 05:39 View session

Copilot AI reviewed Jan 15, 2026

View reviewed changes

lvhan028 approved these changes Jan 15, 2026

View reviewed changes

lvhan028 merged commit ae2d004 into InternLM:main Jan 15, 2026
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix performance regression for prefix caching #4270

Fix performance regression for prefix caching #4270

Uh oh!

lzhangzz commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Copilot AI Jan 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	if (auto& seq = *sequences[i]; !seq.prompt.empty()) {
	if (auto& seq = *sequences[i]; !seq.prompt.empty() && !seq.blocks.empty()) {

-        if (const int offset = seq.cache_len / block_seq_len_; offset < block_ids.size()) {
+        // NOTE: seq.cache_len is expected to be aligned to block_seq_len_. The integer division below
+        // intentionally converts from cached tokens to a whole-block offset; any partial block (non-zero
+        // remainder) is ignored here and will be effectively truncated.
+        const int offset = seq.cache_len / block_seq_len_;
+        if (seq.cache_len % block_seq_len_ != 0 && rank_ == 0) {
+            TM_LOG_DEBUG(
+                "[SeqMgr][match] ID %llu, cache_len %d not aligned to block_seq_len_ %d; truncating to %d blocks",
+                seq.id, seq.cache_len, block_seq_len_, offset);
+        }
+        if (offset < static_cast<int>(block_ids.size())) {

Fix performance regression for prefix caching #4270

Fix performance regression for prefix caching #4270

Uh oh!

Conversation

lzhangzz commented Jan 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants